Predicting Diabetes

Demi van den Biggelaar (9660089)
August Gesthuizen (5292565)
Friso Harff (7526946)
Leander van der Waal (7180063)

2026-01-15

Inhoudsopgave - Leander

  • Research Question
  • The Data
  • Models
  • The Best Model
  • Assumptions
  • Conclusion model
  • Answer Research Question

Introduction Research Question

  • Diabetes

  • Common

  • Diagnosis

The Research Question

Can we predict if someone is diabetic using research data with a logistic regression?

The Dataset

National Institute of Diabetes and Digestive and Kidney Diseases (1990)

Exploring the data

  • Pregnancies: Number of times pregnant

  • Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test

  • BloodPressure: Diastolic blood pressure (mm Hg)

  • SkinThickness: Triceps skin fold thickness (mm)

  • Insulin: 2-Hour serum insulin (mu U/ml)

  • BMI: Body mass index (weight in kg/(height in m)^2)

  • DiabetesPedigreeFunction: a function that scores the probability of diabetes based on family history, with a realistic range of 0.08 to 2.42

  • Age: Age (years)

  • Outcome: Class variable (0 or 1)

Preprocessing

# Recoding BMI to classes
dataRDS$BMI[dataRDS$BMI <= 18.5 ] <- 1                     # Underweight
dataRDS$BMI[dataRDS$BMI > 18.5 & dataRDS$BMI <= 25  ] <- 2 # Healthy
dataRDS$BMI[dataRDS$BMI > 25 & dataRDS$BMI <= 30 ] <- 3    # Overweight
dataRDS$BMI[dataRDS$BMI > 30] <- 4                         # Obese

# Removing missing values
dataRDS <- subset(dataRDS,
                     BloodPressure != 0 &
                     SkinThickness != 0 &
                     Glucose != 0 &
                     Insulin != 0 &
                     BMI != 0)

Interesting Finds

259/393 = 66% are obese

Beste model kiezen en uitleggen (leg uit wat het beste model is, wat het doet en waarom het het beste is) - August

When you click the Render button a presentation will be generated that includes both content and the output of embedded code. You can embed code like this:
#########VERGEET NIET DE MODERATIES TE BENOEMEN EN KOPPEL HET MISSCHIEN TERUG AAN DE PAIRS TABEL HIERBOVEN, OOK KAN JE ZELF ALTIJD MODELLEN AANPASSEN OM HET VERHAAL BETER TE MAKEN###########

Binary dependent variable

  • Diabetes (coded 1)
  • Not diabetes (coded 0)

Sufficiently large sample size

  • 10 cases per candidate predictor : \(N=\frac{10k}{p}\)
  Outcome   n      prop
1       0 263 0.6692112
2       1 130 0.3307888
  • k=4, p=0.33 -> 121 required

  • 130 positive

Full-rank predictor matrix

  • More observations than predictors

  • No multicollinearity among linear predictors

                     Age                  Glucose                      BMI 
                1.022157                 1.018869                 1.009314 
DiabetesPedigreeFunction 
                1.006276 
  • VIF=1 : no correlation

No influential values or outliers

  • No values outside Cook’s distance

  • No influential observations

Deviance Residuals?

When you click the Render button a presentation will be generated that includes both content and the output of embedded code. You can embed code like this:

[1] 2

Exploring the model section, ga in op het model met onderzoeksmethode uit slides INCLUDING CONFUSION MATRIX - Demi

When you click the Render button a presentation will be generated that includes both content and the output of embedded code. You can embed code like this:

[1] 2

Concludeer en beantwoord - Demi

When you click the Render button a presentation will be generated that includes both content and the output of embedded code. You can embed code like this:

[1] 2